ASCII Phonetic Symbols for the World's Languages: Worldbet

نویسندگان

  • James L. Hieronymus
  • George Allen
  • Ian Maddieson
  • John Wells
چکیده

A new ASCII encoding of the International Phonetic Alphabet IPA and additional symbols has been designed for all languages Many of the previous ASCII versions were targeted at European languages and therefore left out many of the sounds of the other languages or used symbols for unusual sounds like clicks for plosive bursts When an attempt was made to label a large number of languages with phonemic and phonetic symbols these were found to be inadequate The present scheme borrows on earlier work by George Allen Ian Maddieson John Wells Laver et al and Hieronymus et al Wherever possible the present scheme was made similar to the base IPA symbols so that many of the symbols will seem to have obvious meanings Many of the symbols are the same as other schemes The underlying principle is that any spectrally and temporally distinct speech sound not including pitch which is phonemic in some language should have a separate base symbol In most cases the base symbol consists of a concatination of an IPA symbol and diacritics Thus it is easy to recognize the phonemic base symbols and compare the same broad phonetic sound across languages Tone languages have diacritics applied to the vowel phoneme symbols to properly identify the phonemes in these languages Allophonic variations due to contextural coarticulation and stress may be labelled by a diacritic attached to the base symbol It is possible that some speech sounds which are phonemic in at least one of the world s languages are missing from the present version It is hoped that any oversights will be corrected in subsequent versions of Worldbet and a standard method for constructing new symbols is presented Introduction Many systems have been developed for writing the sounds of the world s languages Many of the early workers made their own systems because there was no agreed standard or indeed knowledge of the complete speech sound inventory The International Phonetic Alphabet was developed in and revised several times into its present form It represents years of experience with putting a symbol to each sound in all of the known languages in the world The issues of economy of representation and the distinction between allophonic variation and true baseform sound have been worked out for many more languages since the IPA was originally formulated Therefore it is a good place to begin for any multilanguage speech database labelling e ort There are some sounds which are not normally included in the IPA which have been found to be useful in labelling large speech corpora like TIMIT SCRIBE BDSONS and PHONDAT These modern attempts at a standard ascii form of the IPA resulted in TIMITBET MRPA SAMPA and SAMPA Extended to name a few of them These phonetic alphabets were restricted to English or to European languages and thus were too restricted in scope to be used in other major language families The issue is whether or not the ascii representation is consistent complete and logical for all of the IPA symbols Worldbet is an attempt to have a phonetic alphabet which covers all of the world s languages in a systematic fashion It is an ascii version of the IPA plus a number of symbols which were found useful in database labelling which are not currently in the o cial IPA set This list of extra symbols may grow with time until all of the important phenomena have a coherent symbol representation This paper is organized to rst cover the general principles of Worldbet discuss earlier labeling sets give speci c symbol assignments and discuss labeling methods In Appendix A is an exhaustive list the Worldbet symbols and their corresponding labels in a few other systems namely TIMITBET SAMPA and JBET a phonetic alphabet used in speech synthesis Appendix B is a table of place and manner of articulation v s Worldbet symbols In Appendix C there are examples of Worldbet symbol inventories for several languages General Principles Worldbet is an ASCII version of International Phonetic Alphabet IPA with additional broad phonetic symbols not presently in the IPA It is designed for a large set of languages including Indian Asian African and European languages Considerations of the special sounds in each of these languages lead to the principle that each base symbol will represent a speech sound with a spectrally distinct time sequence Each type of r will have its separate IPA like designation rather than the more graphemic r used in some label sets Allophones like aspirated plosives will have a separate base symbol from unaspirated plosives if they are phonemic within the language in question otherwise they will be marked using the base symbol plus diacritic Distinct means to be so di erent spectrally or temporally as to be perceptually di erent when the components are heard in isolation Vowels are classed into nominal place positions It is recognized that the detailed vowel color may vary between languages for the same nominal vowel yet separate symbols will be assigned only when the di erences are large enough to constitute di erent phonemes In actual labeling experience it has been found that most of the di erences in phonetic labels between trained phoneticians were due to disagreements on the detailed vowel color rather than the actual broad vowel color Therefore Worldbet base symbols will represent phonemic distinctions in some language as in the plosive example The base symbols are thus meant to be broad phonetic but can be used as surface phonemic symbols within a given language as stated in the original principles of the IPA Since the IPA has been in use for over years and has been actively developed and evolved over this period it should have all of the phonemic distinctions observed in the world s languages to date Therefore it is the natural starting point for any attempt to construct a phoneme set which is su cient to cover all of the languages in the world Diacritics are used in general to modify the base symbols to deal with allophones which are due to coarticulation e ects i e labialized s in the environment of w or phonological context e ects The diacritic allows the particular allophone to be marked which has as its base character the phonemically based broad phone which is the origin of this allophone Of course it is not always easy to determine what is an allophonic variation and what is a change of broad phonetic category Normally the number of symbols to be used to label a particular language will be limited to keep from having an overly large label inventory The motivating factor for Worldbet is to label speech for corpus driven speech research phono logical inventories automatic language identi cation multi language speech recognition and multi language speech synthesis It should also be useful in constructing multi lingual dictionaries In all of the above uses it is most convenient to have each sound labeled with a particular symbol closely resemble all other sounds with the same label no matter in which language it is uttered Previous Label Sets Past work on ascii to ipa symbol sets was reviewed including the Klatt phonbet Allen et al the PHONASCII system Allen Arpabet used in the rst ARPA Speech Understanding Project TIMITBET used in labelling the DARPA Acoustic Phonetic Database which was collected by Texas Instruments and labelled by MIT the Esprit Speech Assessment Methodology Phonetic Alphabet SAMPA the Edinburgh Machine Readible Phonetic Alphabet MRPA the Alvey Project Phonetic alphabet and the SCRIBE project phonetic label set SAMPA Extended These were generally concerned with one or a few Indo European Languages and thus are missing a number of the symbols needed for other languages For SAMPA some simpilifying assumptions were made because it was thought that they would be used for transcription within one language not across languages This leads to the same symbol being used for quite di erent sounds most notably for r An e ort for world wide ASCII to IPA coverage by Ian Maddieson of UCLA was thought to be too complicated for the present application It is a more detailed label set aimed at ne phonetic distinctions in all the world s languages It does not distinguish between diacritics and baseforms With the full set of diacritics in Worldbet it should be possible to have the same level of detail with the proviso that phonemes with multiple places of articulation might have to have baseform symbols assembled using the Worldbet linking character A new ascii version for the IPA which has been developed on the sci lang email news group has also been examined but seems to su er from too few languages being considered in detail It is supposed to be used in email discussions of phonetics and phonology It is a full encoding of the IPA and has some common symbols with

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

ASCII Based Transcription Systems for Languages with the Arabic Script: The Case of Persian

In this paper, we discuss transcription systems needed for automated spoken language processing applications in languages such as Persian that use the Arabic script for writing. The work is described in the context of a speech-to-speech translation system development for English and Persian. This system can easily be modified for Arabic, Dari, Urdu and any other language that uses the Arabic sc...

متن کامل

A Transcription Scheme For Languages Employing The Arabic Script Motivated By Speech Processing Applications

This paper offers a transcription system for Persian, the target language in the Transonics project, a speech-to-speech translation system developed as a part of the DARPA Babylon program (The DARPA Babylon Program; Narayanan, 2003). In this paper, we discuss transcription systems needed for automated spoken language processing applications in Persian that uses the Arabic script for writing. Th...

متن کامل

A Transcription Scheme for Languages Employing the Arabic Script Motivated by Speech Processing Application

This paper offers a transcription system for Persian, the target language in the Transonics project, a speech-to-speech translation system developed as a part of the DARPA Babylon program (The DARPA Babylon Program; Narayanan, 2003). In this paper, we discuss transcription systems needed for automated spoken language processing applications in Persian that uses the Arabic script for writing. Th...

متن کامل

The Effect of Using Phonetic Websites on Iranian EFL Learners’ Word Level Pronunciation

Computer-assisted language learning (CALL) is reaching an up most position in the pedagogical field of English as a Second or Foreign Language (ESL/EFL). The present study was carried out to study the effect of using phonetic websites on Iranian EFL students’ pronunciation and knowledge of phonemic symbols. Participants of the study included 30 EFL female pre-intermediate students studyin...

متن کامل

Om : One tool for many ( Indian ) languages

Many different languages are spoken in India, each language being the mother tongue of tens of millions of people. While the languages and scripts are distinct from each other, the grammar and the alphabet are similar to a large extent. One common feature is that all the Indian languages are phonetic in nature. In this paper we describe the development of a transliteration scheme Om which explo...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1993